HIVE-29544: Fix Vectorized Parquet reading Struct columns with all fields null by ayushtkn · Pull Request #6408 · apache/hive

ayushtkn · 2026-04-04T11:27:28Z

Fix Vectorized Parquet reading Struct columns with all fields null

What changes were proposed in this pull request?

During Vectorized Parquet reads, the struct column shouldn't be read as null in case the fields are null, only if the struct field itself is null, the column should be read as null

Why are the changes needed?

To sync behaviour b/w vectorized & non vectorized reads

Does this PR introduce any user-facing change?

Yes, Vectorized reads with Struct columns with all values null doesn't read as null column

How was this patch tested?

UT

…elds null

abstractdog

@ayushtkn: this is a very valuable and important correctness patch so far, left some comments

I also ran the qtest without the patch, and I can confirm the patch worked

I tried to pick up the definition level concept (again), and more or less, I understand this patch
I'm inclined to accept it if you cover the rest of the use-cases I mentioned

abstractdog · 2026-04-08T09:21:44Z

ql/src/test/org/apache/hadoop/hive/ql/io/parquet/VectorizedColumnReaderTestBase.java

-      if (i % 3 != 0) {
-        some_null_g.append("g", doubleVal);
+      if (i % 6 != 0) {
+        Group some_null_g = group.addGroup("struct_field_some_null");


as you're already touching this part, please fix some_null_g variable naming

Ouch. I missed it, let me fix it in next iter to struct_field_with_null

abstractdog · 2026-04-08T09:22:13Z

ql/src/test/queries/clientpositive/parquet_struct_with_null_vectorization.q

+-- Test D: Verify field-level null evaluation inside a valid struct
+SELECT id FROM test_parquet_struct_nulls WHERE st_prim IS NOT NULL AND st_prim.x IS NULL;
+
+-- Validate withou vectorization


nit: without

abstractdog · 2026-04-08T09:22:53Z

ql/src/test/queries/clientpositive/parquet_struct_with_null_vectorization.q

+
+-- Validate withou vectorization
+SET hive.vectorized.execution.enabled=true;
+SET hive.vectorized.execution.reduce.enabled=false;


isn't this simpler and closer to the fact "without vectorization"

SET hive.vectorized.execution.enabled=false;

abstractdog · 2026-04-08T09:51:27Z

ql/src/java/org/apache/hadoop/hive/ql/io/parquet/vector/VectorizedParquetRecordReader.java

is the same problem applies to another complex types? better to have more test cases and maybe even rename the qfile to parquet_complex_..., maybe parquet_complex_types_null_vectorization.q to clearly distinguish from the already existing parquet_complex_types_vectorization.q

e.g.:
LIST:

NULL vs. [null, null] or [1, null]

or MAP ("same" as struct but not fixed schema)

NULL vs. { "a": 1, "b": null } and { "a": 1, "b": null, "c": null }

or even nested struct to validate the logic on deeper recursive callpaths:

CREATE TABLE test_parquet_nested_struct_nulls ( id INT, st_prim STRUCT<x:INT, y:INT>, st_nested STRUCT<x:INT, y:STRUCT<v:INT, w:INT>> ) STORED AS PARQUET;

and nested data can contain NULL on different levels where you patch is actually hit I assume, e.g.:

NULL {x: 1, y: NULL} {x: 1, y: {v: 2, w: NULL} {x: 1, y: {v: 2, w: 3}

I would appreciate a lot if this patch could validate all of these

The Struct value turning to Null, if sub fields are NULL was due to specific code in VectorizedStructReader

hive/ql/src/java/org/apache/hadoop/hive/ql/io/parquet/vector/VectorizedStructColumnReader.java

Lines 50 to 55 in 13f3208

for (int j = 0; j < vectors[i].isNull.length; j++) {

structColumnVector.isNull[j] =

(i == 0) ? vectors[i].isNull[j] : structColumnVector.isNull[j] && vectors[i].isNull[j];

}

structColumnVector.noNulls =

(i == 0) ? vectors[i].noNulls : structColumnVector.noNulls && vectors[i].noNulls;

It used to check, if all fields have null, it used to set null, that I fixed.

I checked the LIST & MAP. They have another bug :-) In LIST & MAP,

If you insert NULL in them it defaults to 0

eg.

CREATE TABLE test_parquet_array_nulls ( > id INT, > arr_prim ARRAY<INT> > ) STORED AS PARQUET; > > INSERT INTO test_parquet_array_nulls VALUES > -- 1: Array exists, but all elements inside are NULL > (1, array(CAST(NULL AS INT), CAST(NULL AS INT))), > > -- 2: The Array itself is strictly NULL > (2, if(1=0, array(1, 2), null)), > > -- 3: Array exists, containing a mix of valid and NULL elements > (3, array(3, CAST(NULL AS INT))), > > -- 4: Array exists, all elements are valid > (4, array(4, 5)); > > SELECT * FROM test_parquet_array_nulls ORDER BY id;

It outputs

+------------------------------+------------------------------------+ | test_parquet_array_nulls.id | test_parquet_array_nulls.arr_prim | +------------------------------+------------------------------------+ | 1 | [0,0] | | 2 | NULL | | 3 | [3,0] | | 4 | [4,5] | +------------------------------+------------------------------------+

Disabling vectorization gives correct

+------------------------------+------------------------------------+ | test_parquet_array_nulls.id | test_parquet_array_nulls.arr_prim | +------------------------------+------------------------------------+ | 1 | [null,null] | | 2 | NULL | | 3 | [3,null] | | 4 | [4,5] | +------------------------------+------------------------------------+

This seems some different bug, like every NULL is treated as 0.

Will it be ok, if we chase this in a different ticket. I believe it is some where it is returning default value of int instead NULL, some check is wrong which I have to debug.

Regarding nested, vectorization is disabled so, that doesn't kick in:
https://issues.apache.org/jira/browse/HIVE-19016

For map we already have a test: https://github.com/apache/hive/blob/master/ql/src/test/queries/clientpositive/parquet_map_null_vectorization.q

For the Map there is test but the output is wrong, it is testing for null but the output is having 0,
https://github.com/apache/hive/blame/13f3208c01fec2d20108302efc3fd033d1d76a19/ql/src/test/results/clientpositive/llap/parquet_map_null_vectorization.q.out#L153-L158

It should have been

+-----+---------------+ | id | intmap | +-----+---------------+ | 1 | {1:null,2:3} | | 2 | NULL | +-----+---------------+

The inserts were

insert into parquet_map_type_int SELECT 1, MAP(1, null, 2, 3) insert into parquet_map_type_int (id) VALUES (2)

Let me know if you are ok chasing this separately

abstractdog · 2026-04-08T09:58:04Z

ql/src/java/org/apache/hadoop/hive/ql/io/parquet/vector/VectorizedPrimitiveColumnReader.java

      ColumnVector column,
      TypeInfo columnType) throws IOException {
+    this.currentDefLevels = new int[total];
+    this.defLevelIndex = 0;


can you elaborate on this change: I mean, I can see that passing and handling definition level to the struct reader solves the current issue, but this looks like fixing another bug, is it the case?

Because I removed the old, flawed logic that incorrectly ANDed the child isNull flags (which was the root cause of the bug for structs with all null fields), the struct reader needs a reliable way to know when the struct itself is actually null. Fetching the Definition Levels from the primitive reader isn't a separate fix; it is the replacement mechanism for the deleted logic. It is the only correct way in Parquet to distinguish between an explicitly NULL struct and a valid struct containing null fields.

If you run the test & remove these two lines, the 2nd insert NULL won't evaluate correctly

If we don't initialize that array, the primitive reader skips recording the D-levels, and getDefinitionLevels() returns null. The struct reader then bypasses the D-level evaluation block entirely. Because the old flawed AND logic is gone, and the new D-level logic is bypassed, the struct vector defaults to isNull = false. This causes a genuinely NULL struct (Row 2) to incorrectly evaluate as an existing struct with null fields: {"x":null,"y":null}."

abstractdog · 2026-04-08T11:03:41Z

ql/src/test/queries/clientpositive/parquet_struct_with_null_vectorization.q

+SELECT id FROM test_parquet_struct_nulls WHERE st_prim IS NULL;
+
+-- Test C: Verify IS NOT NULL evaluates correctly
+SELECT id FROM test_parquet_struct_nulls WHERE st_prim IS NOT NULL ORDER BY id;


instead of ORDER BY, what about:

-- SORT_QUERY_RESULTS

abstractdog · 2026-04-08T11:04:08Z

ql/src/test/queries/clientpositive/parquet_struct_with_null_vectorization.q

+-- Validate withou vectorization
+SET hive.vectorized.execution.enabled=true;
+SET hive.vectorized.execution.reduce.enabled=false;
+SELECT * FROM test_parquet_struct_nulls ORDER BY id;


-- SORT_QUERY_RESULTS

abstractdog · 2026-04-08T11:53:46Z

ql/src/java/org/apache/hadoop/hive/ql/io/parquet/vector/VectorizedStructColumnReader.java

+    if (defLevels != null) {
+      for (int j = 0; j < total; j++) {
+        if (defLevels[j] < structDefLevel) {
+          // The D-Level boundary crossed the struct. The whole struct is null.


nit: is D-Level a well-known abbrevation? if so, it's fine, otherwise we can use Definition Level similarly to one comment above

abstractdog · 2026-04-08T12:40:21Z

ql/src/test/org/apache/hadoop/hive/ql/io/parquet/VectorizedColumnReaderTestBase.java

-      }
-      if (i % 3 != 0) {
-        some_null_g.append("g", doubleVal);
+      if (i % 6 != 0) {


even if I like the applied math here, which is i % 2 != 0 || i % 3 != 0 + De Morgan law's applied to prevent unnecessary group.addGroup call, I feel we can afford this kind of verbosity in the unit test to be easier to read

abstractdog · 2026-04-08T12:42:31Z

ql/src/java/org/apache/hadoop/hive/ql/io/parquet/vector/VectorizedStructColumnReader.java

+    for (VectorizedColumnReader reader : fieldReaders) {
+      defLevels = reader.getDefinitionLevels();
+      if (defLevels != null) {
+        break;
+      }
+    }


seems like repeated logic, could this be handled with getDefinitionLevels?

sonarqubecloud · 2026-04-08T18:51:17Z

Quality Gate passed

Issues
2 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

asf-ci-hive added tests pending tests unstable and removed tests pending labels Apr 4, 2026

ayushtkn force-pushed the parquet_struct branch from a89730f to 0104837 Compare April 4, 2026 14:46

ayushtkn changed the title ~~WIP~~ HIVE-29544: Fix Vectorized Parquet reading Struct columns with all fields null Apr 4, 2026

ayushtkn requested a review from abstractdog April 4, 2026 15:38

asf-ci-hive added tests pending tests passed and removed tests unstable tests pending labels Apr 4, 2026

HIVE-29544: Fix Vectorized Parquet reading Struct columns with all fi…

ca90f16

…elds null

ayushtkn force-pushed the parquet_struct branch from 0104837 to ca90f16 Compare April 7, 2026 06:56

asf-ci-hive added tests pending tests passed and removed tests passed tests pending labels Apr 7, 2026

abstractdog requested changes Apr 8, 2026

View reviewed changes

Review Comments

fe1726e

asf-ci-hive added tests pending and removed tests passed labels Apr 8, 2026

asf-ci-hive added tests passed and removed tests pending labels Apr 8, 2026

	for (int j = 0; j < vectors[i].isNull.length; j++) {
	structColumnVector.isNull[j] =
	(i == 0) ? vectors[i].isNull[j] : structColumnVector.isNull[j] && vectors[i].isNull[j];
	}
	structColumnVector.noNulls =
	(i == 0) ? vectors[i].noNulls : structColumnVector.noNulls && vectors[i].noNulls;

Conversation

ayushtkn commented Apr 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

abstractdog left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ayushtkn Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ayushtkn Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sonarqubecloud bot commented Apr 8, 2026

Quality Gate passed

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ayushtkn commented Apr 4, 2026 •

edited

Loading

abstractdog left a comment •

edited

Loading

ayushtkn Apr 8, 2026 •

edited

Loading

ayushtkn Apr 8, 2026 •

edited

Loading